This report will conduct a machine learning analysis of the provided dataset. This analysis will be broken into multiple sections. First, the data will be loaded for data cleaning and exploration to take place. Various methods will be employed to clean and transform the data, making it suitable for efficient and effective use in machine learning models. Then, unsupervised learning methods will be used. The methods include clustering, tSNE, and PCA. These methods will be used to gain insights into how the data is grouped, discover potential patterns, as well assess the potential for dimensionality reduction. Finally, supervised machine learning methods will be used. These methods include logistic regression, random forest, and XGBoost. These models will be trained in a binary classification task to predict the diagnosis column within the dataset. SHAP values will then be used to explain the predictive output of the best performing model, to draw further insights. At the end, conclusions will be drawn about the dataset and the effectiveness of each stage. Questions will be asked about the best performing supervised learning model, as well as the suitability of the dataset for unsupervised methods, and the potential for further work and investigation will be discussed. The limitations of the work conducted will also be discussed in that context.
The below code block imports all of the packages used throughout the code:
# To load and manipulate the dataset
import pandas as pd
import numpy as np
from scipy.stats import normaltest, skew
#For data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# Statistical tests and transformations
from scipy.stats import boxcox, shapiro
# For Imputation and Scaling
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
# For Clustering
import plotly.express as px
from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.pipeline import make_pipeline
# For tSNE and PCA
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
# Libraries for Supervised learning and SHAP
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score, roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb
import shap
# For Table Creation
from tabulate import tabulate
The dataset is provided within a CSV file, so the first step is to load in the CSV as a pandas dataframe so it can be easily used. Then, an initial check is run of all the columns for both missing values, as well as duplicate values. Missing values are important to identify as they will need to be handled in some way before many machine learning models can be trained. Duplicate values are important to identify as they should be removed from the dataset to avoid bias. Running the code shows that the variables "Suspicious", "Pregnant", and "Intrusive_Thoughts" have missing values, and that there are no duplicate rows within the dataset.
# Loading the dataset into a pandas DataFrame
data = pd.read_csv('./MS4H03_Dataset.csv')
# Checking the data for missing values
print(data.isnull().sum().sort_values(ascending=False))
# Checking the data for duplicate rows
duplicate = data[data.duplicated()]
print("Duplicate Rows :")
Suspicious 2873 Pregnant 2238 Intrusive_Thoughts 830 Diagnosis 0 Passive 0 Unusual_Thought 0 Tired 0 Tension 0 Stress 0 Sleep 0 Sex 0 Rumination 0 Race 0 Psychomotor 0 Participant 0 Anhedonia 0 Housing 0 Hallucination 0 Focus 0 Dep_Mood 0 Delusion 0 Delay 0 Content 0 Concentration 0 Appetite 0 Apathy 0 Withdrawal 0 dtype: int64 Duplicate Rows :
The next step is to view the head of the data, which provides a view of the first 5 rows of the dataset. This allows a quick view of what each column looks like, which can help quickly identify potential problems or trends. For example, within the head, missing values for Suspicious can be seen, as well as some categorical variables which will need to be encoded later on.
# Shows the first 5 entries in the dataset
data.head()
| Diagnosis | Anhedonia | Apathy | Appetite | Concentration | Content | Delay | Delusion | Dep_Mood | Focus | ... | Race | Rumination | Sex | Sleep | Stress | Suspicious | Tension | Tired | Unusual_Thought | Withdrawal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4.501446 | 3.056575 | 9.283891 | 8.305502 | 0.006142 | Yes | 1.170842 | 8.177884 | 8.305502 | ... | White | 5.041432 | Female | 6.552082 | 5.915492 | 3.991749 | 6.371877 | 4.537270 | 3.744410 | 5.242819 |
| 1 | 0 | 5.432608 | 0.307721 | 25.809400 | 5.060206 | 0.062209 | Yes | 2.272548 | 11.162913 | 5.060206 | ... | White | 4.656790 | Male | 5.558434 | 4.008265 | NaN | 2.238665 | 6.113746 | 0.720458 | 3.797242 |
| 2 | 0 | 6.557500 | -1.412208 | 24.842012 | 6.504229 | 0.071532 | No | 1.439095 | 5.887363 | 6.504229 | ... | Black | 7.290494 | Female | 5.787561 | 7.177926 | NaN | 6.992499 | 5.557374 | -0.491255 | 4.421288 |
| 3 | 0 | 5.429568 | 1.686157 | 24.175853 | 6.994948 | 0.278345 | No | 1.380185 | 8.943851 | 6.994948 | ... | White | 6.759339 | Female | 6.787287 | 2.866815 | NaN | 4.018286 | 6.136269 | 0.765388 | 2.299562 |
| 4 | 1 | 5.099846 | -0.612506 | 45.808490 | 5.869212 | 0.004214 | Yes | 1.237976 | 6.741627 | 5.869212 | ... | Black | 6.190019 | Male | 7.748312 | 6.798220 | 5.103103 | 5.063542 | 4.213723 | 0.705895 | 6.547707 |
5 rows × 27 columns
Then the info for the dataset is viewed to provide a quick overview of each column in the dataset. This shows that there are 27 columns, with 21 floats, 2 ints, and 4 object data types. The object data columns are Delay, Housing, Race, and Sex. These columns will all need to be encoded later on before being used in machine learning models.
#Provides information about the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Diagnosis 5000 non-null int64 1 Anhedonia 5000 non-null float64 2 Apathy 5000 non-null float64 3 Appetite 5000 non-null float64 4 Concentration 5000 non-null float64 5 Content 5000 non-null float64 6 Delay 5000 non-null object 7 Delusion 5000 non-null float64 8 Dep_Mood 5000 non-null float64 9 Focus 5000 non-null float64 10 Hallucination 5000 non-null float64 11 Housing 5000 non-null object 12 Intrusive_Thoughts 4170 non-null float64 13 Participant 5000 non-null int64 14 Passive 5000 non-null float64 15 Pregnant 2762 non-null float64 16 Psychomotor 5000 non-null float64 17 Race 5000 non-null object 18 Rumination 5000 non-null float64 19 Sex 5000 non-null object 20 Sleep 5000 non-null float64 21 Stress 5000 non-null float64 22 Suspicious 2127 non-null float64 23 Tension 5000 non-null float64 24 Tired 5000 non-null float64 25 Unusual_Thought 5000 non-null float64 26 Withdrawal 5000 non-null float64 dtypes: float64(21), int64(2), object(4) memory usage: 1.0+ MB
A description of the dataset is then given to provide summary statistics for each column. This reveals some interesting insights. For example, content has a very high range, with a minimum value of 0.000187 and a maximum value of 21. Hallucination has an even greater range, with a minimum value of 0.027 and a maximum value of 6287. This will need to be considered later on. Another interesting result here is that the Tired column has inifite values in it, as the mean is registering as inf. There are other interesting things in the dataset that will be discussed below when cleaning the data.
# Provides summary statistics of the columns in the dataset
data.describe()
| Diagnosis | Anhedonia | Apathy | Appetite | Concentration | Content | Delusion | Dep_Mood | Focus | Hallucination | ... | Pregnant | Psychomotor | Rumination | Sleep | Stress | Suspicious | Tension | Tired | Unusual_Thought | Withdrawal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | ... | 2762.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 2127.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 0.505800 | 6.502860 | 2.478849 | 27.070029 | 6.519924 | 0.279407 | 2.637389 | 5.727062 | 6.519924 | 65.073832 | ... | 0.101376 | 4.680583 | 5.685816 | 7.011266 | 4.916418 | 2.754909 | 4.920667 | inf | 2.480266 | 3.958121 |
| std | 0.500016 | 1.488151 | 1.730810 | 14.202618 | 1.474846 | 0.834494 | 1.440347 | 3.284501 | 1.474846 | 223.943469 | ... | 0.301880 | 1.482692 | 2.161891 | 1.410841 | 2.220262 | 1.496126 | 1.962342 | NaN | 1.410475 | 1.469444 |
| min | 0.000000 | 1.098854 | -3.211011 | 0.141074 | 1.299964 | 0.000187 | -2.127037 | 0.000000 | 1.299964 | 0.027350 | ... | 0.000000 | -0.024974 | -0.409032 | 2.144726 | -3.257788 | -2.346238 | -2.183456 | 0.366650 | -1.981307 | -0.825919 |
| 25% | 0.000000 | 5.495361 | 1.265128 | 16.724108 | 5.528181 | 0.018655 | 1.629919 | 4.678095 | 5.528181 | 4.113962 | ... | 0.000000 | 3.697870 | 4.042552 | 6.058402 | 3.443683 | 1.703462 | 3.565482 | 4.491580 | 1.486439 | 2.969534 |
| 50% | 1.000000 | 6.485527 | 2.427409 | 25.165292 | 6.498042 | 0.064259 | 2.558146 | 6.752196 | 6.498042 | 12.764604 | ... | 0.000000 | 4.720156 | 5.521805 | 6.980519 | 5.096416 | 2.735108 | 5.247353 | 5.513508 | 2.388994 | 3.962131 |
| 75% | 1.000000 | 7.489218 | 3.642059 | 35.447666 | 7.519759 | 0.215773 | 3.588012 | 8.045706 | 7.519759 | 41.814204 | ... | 0.000000 | 5.682627 | 7.276673 | 7.977138 | 6.531673 | 3.725759 | 6.385145 | 6.569176 | 3.426667 | 4.972302 |
| max | 1.000000 | 11.603140 | 8.803433 | 113.438734 | 11.649649 | 21.001327 | 8.978785 | 12.003550 | 11.649649 | 6287.163151 | ... | 1.000000 | 10.171540 | 12.009666 | 11.920312 | 11.970952 | 8.212275 | 9.622076 | inf | 8.066822 | 9.022207 |
8 rows × 23 columns
Now that a basic understanding of the dataset has been achieved, the next step is to start cleaning and preprocessing the data so that it may be used in model creation down the line. This involves many steps, as different columns have different cleaning needs. Some columns with missing or strange values need to be addressed, and categorical variables need to be encoded before being used in models.
First, the Pregnant column registers as having many missing values. However when viewing the dataset, it is clear that all the missing values in the column are Male in the sex column. This makes sense as a biological male would be unable to get pregnant, so this data is unlikely to be missing at random. As a result, I replace all of the missing values in the column with a 0 which indicates that they are not pregnant. As the information for sex is already contained within another column, this solution should be suitable. It is worth noting that this is assuming that 0 is not pregnant and 1 is pregnant, as common convention is for 0 to be negative and 1 to be positive, however this is not explicitly stated within the dataset.
Next I dropped two columns from the dataset. The Participant column was dropped as it has no variation, the value is 1 for every sample within the dataset. The Concentration column was also dropped as the values were identical to the values in the Focus column, which means only one of the two columns needs to be kept.
Then I encoded 3 different colums: Delay, Sex, and Housing. Each of these columns are categorical variables that only include two categories. Therefore I can simply recode the columns as 0s and 1s instead.
After that, the Race column also needed to be encoded, however this column has 4 different categories. Therefore one-hot encoding was necessary so dummy variables are created for each of the 4 possible Race categories. The Race column itself can then be dropped.
Then there were a series of columns with interesting values within them. Passive has many -999 values, Dep_Mood has many 0 values, and Tired has many inf values. There are multiple ways these values could be interpreted. Without a domain knowledge of the dataset and how it is has been recorded, it is impossible to know for certain. It is possible that these strange values are legitimate values for the variables that they represent. For example, if Dep_Mood is a particular symptom, then it is possible that a value of 0 is a legitimate reading for someone who does not display that symptom. It is possible that these values contain information within them and are not totally errors or values missing at random. Without being able to know for certain, I had to decide how to handle these values, and I made the decision to treat them as missing values, and replace them with NaN values within the dataset. This ensures that those strange values will not impact model creation down the line. However, I acknowledge the possibility that these values may contain information, could appear again in new data, and perhaps should be treated differently if a greater level of domain knowledge is acquired.
The code below processes all of these changes and outputs a cleaned dataset. As the data is explored and analysed, more data cleaning and transformations will need to be conducted but this provides a good starting point.
# Creates a new data frame for the cleaning to take place
data_clean = data
# Fills the missing values for Pregnant with a 0
data_clean["Pregnant"] = data["Pregnant"].fillna(0)
# Drops the columns Partipant and Concentration
data_clean = data_clean.drop(["Participant", "Concentration"], axis=1)
# Encodes Delay as 1 and 0
data_clean['Delay'] = data_clean['Delay'].replace(['Yes', 'No'], [1 , 0])
# Encodes Sex as 1 and 0
data_clean['Sex'] = data_clean['Sex'].replace(['Male', 'Female'], [1 , 0])
# Encodes Housing as 1 and 0
data_clean['Housing'] = data_clean['Housing'].replace(['Stable', 'Unstable'], [1 , 0])
# Converts Race into dummy variables
race_dummies = pd.get_dummies(data_clean.Race)
data_clean = pd.concat([data_clean, race_dummies], axis=1)
data_clean = data_clean.drop(["Race"], axis = 1)
# Replaces the -999 values in Passive with missing values
data_clean['Passive'].replace(-999, np.NaN, inplace = True)
# Replaces the -999 values in Passive with missing values
data_clean['Dep_Mood'].replace(0, np.NaN, inplace = True)
# Replaces the inf values in Tired with missing values
data_clean.replace([np.inf, -np.inf], np.nan, inplace=True)
The next step is to begin exploring the dataset, looking at correlations, trends, and assessing statistical attributes of different features. To begin this step, a correlation heatmap was plotted for all the features in the cleaned dataset.
Looking at a correlation heatmap can be useful for a few reasons. Firstly it can help identify features that are highly correlated with the target feature. This can help with feature selection when trying to decrease dimensionality, and create a simpler model. Correlation heatmaps can also be used to detect multicollinearity. This is where two or more features are highly correlated with each other, which can cause problems in some models. It also provides some insight into how each feature relates to one another.
Looking at this heatmap, some things immediately stand out. There are some features that correlate very highly with each other. Rumination and Intrusive_Thoughts are very highly correlated, as are Stress and Tension, and Unusual_Thought and Apathy. These correlations will be further assessed later on.
Furthermore, the most highly correlated features with the target feature, 'Diagnosis', are Rumination, Intrusive_Thoughts, Tension, Stress and Delusion. These features should also be looked at in more detail.
# Code for a correlation heatmap of the features in the dataset
plt.figure(figsize=(26, 12)) # Create a figure
heatmap = sns.heatmap(data_clean.corr(), vmin=-1, vmax=1, annot=True, cmap='PuOr') # plots a heatmap of the correlation coefficients
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12) # figure formatting
Text(0.5, 1.0, 'Correlation Heatmap')
The block of code below takes pairs of features that correlate highly with each other, and calculates their average correlation to all features excluding the target feature. This is in order to see which feature in each pair correlates more with the other features. Looking at the results, Rumination, Tension, and Unusual_Thought all have higher correlations with the other features than the feature they are highly correlated with. This could mean they are good features to exclude when considering feature selection.
corr_data = data.corr(numeric_only = True)
corr_data = corr_data.abs()
corr_data = corr_data.drop("Diagnosis")
print(corr_data["Rumination"].mean(axis=0))
print(corr_data["Intrusive_Thoughts"].mean(axis=0))
print(corr_data["Tension"].mean(axis=0))
print(corr_data["Stress"].mean(axis=0))
print(corr_data["Apathy"].mean(axis=0))
print(corr_data["Unusual_Thought"].mean(axis=0))
0.2113599553473987 0.20385214755065645 0.20455901549271044 0.19226235650729623 0.16659931539136502 0.18455294730347846
The next step is to look at the most correlated features with the target feature in more detail. To do this, a box plot and a histogram for the top 10 features most correlated with the target feature have been plotted. This helps to see the distribution of the data, as well see how many outliers different features might have.
It is useful to see how normally distributed different features are. In this case, a lot of the features appear skewed, with tension looking to have the highest skew. However, overall, most of the features are reasonably normally distributed from the eye test. Normality will be assessed statistically later on to more formally analyse the distributions of each feature.
Looking at the boxplots, outliers seem fairly common in many of the features. This is something that will need to be considered. There are many strategies and different ideas around handling outliers. My conclusions regarding outliers will be discussed at the end of the EDA.
# Creates a boxplot and a histogram for the most correlated features with Diagnosis in the dataset
fig, ax = plt.subplots(10, 2, figsize = (25, 20))
sns.boxplot(x= data_clean["Rumination"], ax = ax[0,0])
sns.histplot(data_clean['Rumination'], ax = ax[0,1])
sns.boxplot(x= data_clean["Intrusive_Thoughts"], ax = ax[1,0])
sns.histplot(data_clean['Intrusive_Thoughts'], ax = ax[1,1])
sns.boxplot(x= data_clean["Tension"], ax = ax[2,0])
sns.histplot(data_clean['Tension'], ax = ax[2,1])
sns.boxplot(x= data_clean["Stress"], ax = ax[3,0])
sns.histplot(data_clean['Stress'], ax = ax[3,1])
sns.boxplot(x= data_clean["Suspicious"], ax = ax[4,0])
sns.histplot(data_clean['Suspicious'], ax = ax[4,1])
sns.boxplot(x= data_clean["Delusion"], ax = ax[5,0])
sns.histplot(data_clean['Delusion'], ax = ax[5,1])
sns.boxplot(x= data_clean["Anhedonia"], ax = ax[6,0])
sns.histplot(data_clean['Anhedonia'], ax = ax[6,1])
sns.boxplot(x= data_clean["Focus"], ax = ax[7,0])
sns.histplot(data_clean['Focus'], ax = ax[7,1])
sns.boxplot(x= data_clean["Unusual_Thought"], ax = ax[8,0])
sns.histplot(data_clean['Unusual_Thought'], ax = ax[8,1])
sns.boxplot(x= data_clean["Tired"], ax = ax[9,0])
sns.histplot(data_clean['Tired'], ax = ax[9,1])
plt.tight_layout()
Next, a pairplot was plotted for the most correlated features with Diagnosis. A pairplot plots pairwise relationships between variables in a dataset. Kernel density estimates have been selected to show these relationships, and colour has been used to split the data by Diagnosis outcome. The graph shows some interesting results. For example, The strong correlations between Intrusive Thoughts and Rumination, as well as Stress and Tension, can be really easily visualised by the linear output of their respective graphs. Furthermore, the diagonal row provides a good insight into the distributions of each feature split by the Diagnosis outcome. For example, with Tension, people with a positive diagnosis occupy a much narrow distribution at a higher value, whilst people with a negative diagnosis occupy a broader distribution with a lower value. In all of the KDE plots, some overlap is present, however there is still often a clear difference in the space occupied between the two different Diagnosis outcomes. Overall, looking at a pairplot such as this can help us understand the shape of the data, and the relationships between features.
# Code for creating a pairplot of the features most correlated with Diagnosis
cols = ["Diagnosis", "Rumination", 'Intrusive_Thoughts', "Tension", "Stress", "Suspicious", "Delusion", "Anhedonia", "Focus", "Unusual_Thought", "Tired"] # define the columns to plot
sns.pairplot(data_clean[cols], corner = True, hue = 'Diagnosis', palette="viridis", height = 2.5, kind='kde') # create a pairplot broken down by the outcome
plt.tight_layout() # adjusts the plot parameters for saving the figure
plt.show()
Normality and Skewness can impact the accuracy and interpretability of machine learning models. Therefore, the next step was to formally consider normality and skewness of all the features in the dataset. To do this I created a function that calculates normality and skewness of a dataset input and outputs all the columns that have a normality or skewness beyond a certain threshold. For normality this threshold is a p value of less than 0.05 in a shapiro normality test, and for skewness this is a value of outside the range -0.5 to 0.5. These are commonly used threshold values for these attributes.
The results show 9 features that exceed the threshold for either normality, skewness, or both, and those features will be looked at more closely.
# Function to calculate normality and skewness of each column
def calc_normality_skewness(df):
for col in df.columns:
# Calculate normality test and p-value
normality_test_stat, normality_pval = normaltest(df[col])
# Calculate skewness
skewness = skew(df[col])
# Check if p-value is less than 0.05 or skewness is outside the range of -0.5 to 0.5
if normality_pval < 0.05 or abs(skewness) > 0.5:
# Print column name
print(f"Column '{col}':")
# Print normality test statistic and p-value if p-value is less than 0.05
if normality_pval < 0.05:
print(f" Normality test statistic: {normality_test_stat:.3f} (p-value: {normality_pval:.3f})")
# Print skewness if it is outside the range of -0.5 to 0.5
if abs(skewness) > 0.5:
print(f" Skewness: {skewness:.3f}")
# Call the function to calculate normality and skewness of each column in the dataframe
data_normal = data_clean[["Anhedonia", "Apathy", "Appetite", "Content", "Delusion", "Dep_Mood", "Focus",
"Hallucination", "Intrusive_Thoughts", "Passive", "Psychomotor", "Rumination", "Sleep",
"Stress", "Suspicious", "Tension", "Tired", "Unusual_Thought", "Withdrawal"]]
calc_normality_skewness(data_normal)
Column 'Apathy': Normality test statistic: 19.854 (p-value: 0.000) Column 'Appetite': Normality test statistic: 580.275 (p-value: 0.000) Skewness: 0.853 Column 'Content': Normality test statistic: 8115.299 (p-value: 0.000) Skewness: 10.549 Column 'Delusion': Normality test statistic: 60.152 (p-value: 0.000) Column 'Hallucination': Normality test statistic: 8560.731 (p-value: 0.000) Skewness: 11.736 Column 'Rumination': Normality test statistic: 209.569 (p-value: 0.000) Column 'Stress': Normality test statistic: 108.289 (p-value: 0.000) Column 'Tension': Normality test statistic: 215.220 (p-value: 0.000) Column 'Unusual_Thought': Normality test statistic: 47.476 (p-value: 0.000)
Next, it is important to get a better understanding of the shape and distribution of these features. To achieve this, the 9 features that are not normally distributed or display skewness are plotted below with boxplots and histograms. For very large datasets such as this, normality can often be assumed due to the central limit theorem. This is often the case when performing hypothesis tests or confidence interval estimations. However for certain models such as linear regression, normality is still required to be performed accurately regardless of sample size.
Within this work, linear regression is not being used, and the classification models that will be trained are robust to non-normal data. With this in mind, most of these features are likely within an acceptable level of normality. However, some of the unsupervised methods later used such as PCA, can perform better when the data is approximately normally distributed. Therefore, extreme cases of non-normality and skewness should be dealt with. There are two features in particular that will be addressed: Content and Hallucination. Both of these features have extremely high skewness values and very non normal data, due to a low number of extremely high values.
# Creates a boxplot and a histogram for the most correlated features with Diagnosis in the dataset
fig, ax = plt.subplots(9, 2, figsize = (25, 20))
sns.boxplot(x= data_clean["Apathy"], ax = ax[0,0])
sns.histplot(data_clean['Apathy'], ax = ax[0,1])
sns.boxplot(x= data_clean["Appetite"], ax = ax[1,0])
sns.histplot(data_clean['Appetite'], ax = ax[1,1])
sns.boxplot(x= data_clean["Content"], ax = ax[2,0])
sns.histplot(data_clean['Content'], ax = ax[2,1])
sns.boxplot(x= data_clean["Delusion"], ax = ax[3,0])
sns.histplot(data_clean['Delusion'], ax = ax[3,1])
sns.boxplot(x= data_clean["Unusual_Thought"], ax = ax[4,0])
sns.histplot(data_clean['Unusual_Thought'], ax = ax[4,1])
sns.boxplot(x= data_clean["Hallucination"], ax = ax[5,0])
sns.histplot(data_clean['Hallucination'], ax = ax[5,1])
sns.boxplot(x= data_clean["Rumination"], ax = ax[6,0])
sns.histplot(data_clean['Rumination'], ax = ax[6,1])
sns.boxplot(x= data_clean["Stress"], ax = ax[7,0])
sns.histplot(data_clean['Stress'], ax = ax[7,1])
sns.boxplot(x= data_clean["Tension"], ax = ax[8,0])
sns.histplot(data_clean['Tension'], ax = ax[8,1])
plt.tight_layout()
In order to handle the two features with extremely high skew and non-normality, a Box-Cox transformation will be applied to each feature. A Box-Cox transformation is a data transformation technique used to normalize non-normal data distributions by reducing the variability across different data points. The transformation applies a power function to the data points in a way that best normalises the distribution. The optimal value of the parameter lambda is found, which controls the transformation that is applied to normalise the data.
The below code handles the box-cox transformation of the Content feature and plots a new boxplot and histogram showing the new distribution of the data and retesting for normality. The results show that the box-cox transformation is successful in making the data normally distributed.
# perform Box-Cox transformation on col1
transformed, _ = boxcox(data_clean['Content'])
# add transformed data to DataFrame
data_clean['Content_boxcox'] = transformed
# check for normality using Shapiro-Wilk test
stat, p = shapiro(data_clean['Content_boxcox'])
# print results
print('Shapiro-Wilk Test')
print('Statistic:', stat)
print('P-Value:', p)
if p > 0.05:
print('Data is normally distributed')
else:
print('Data is not normally distributed')
# create subplots of boxplot and histogram
fig, axs = plt.subplots(ncols=2, figsize=(15, 5))
sns.boxplot(x=data_clean['Content_boxcox'], ax=axs[0])
sns.histplot(x=data_clean['Content_boxcox'], ax=axs[1])
# add labels to subplots
axs[0].set(title='Boxplot of Transformed Data', xlabel='Content_boxcox')
axs[1].set(title='Histogram of Transformed Data', xlabel='Content_boxcox')
# show plot
plt.show()
Shapiro-Wilk Test Statistic: 0.9996200203895569 P-Value: 0.4694824814796448 Data is normally distributed
The same method is then applied to the Hallucination feature. The below code handles the box-cox transformation of the Hallucination feature and plots a new boxplot and histogram showing the new distribution of the data and retesting for normality. The results show that the box-cox transformation was also successful in making the data normally distributed.
# perform Box-Cox transformation on col1
transformed, _ = boxcox(data_clean['Hallucination'])
# add transformed data to DataFrame
data_clean['Hallucination_boxcox'] = transformed
# check for normality using Shapiro-Wilk test
stat, p = shapiro(data_clean['Hallucination_boxcox'])
# print results
print('Shapiro-Wilk Test')
print('Statistic:', stat)
print('P-Value:', p)
if p > 0.05:
print('Data is normally distributed')
else:
print('Data is not normally distributed')
# create subplots of boxplot and histogram
fig, axs = plt.subplots(ncols=2, figsize=(15, 5))
sns.boxplot(x=data_clean['Hallucination_boxcox'], ax=axs[0])
sns.histplot(x=data_clean['Hallucination_boxcox'], ax=axs[1])
# add labels to subplots
axs[0].set(title='Boxplot of Transformed Data', xlabel='Hallucination_boxcox')
axs[1].set(title='Histogram of Transformed Data', xlabel='Hallucination_boxcox')
# show plot
plt.show()
Shapiro-Wilk Test Statistic: 0.9994624853134155 P-Value: 0.1619586944580078 Data is normally distributed
As the new transformed columns have been added for Content and Hallucination, the old columns can be dropped to create a new dataset including the transformations, called data_trans.
data_trans = data_clean.drop("Content", axis=1)
data_trans = data_trans.drop("Hallucination", axis=1)
Now that the data has been cleaned, examined, and transformed, the preprocessing and EDA stage is concluded. However, there are some obvious omissions within this section. Missing values have not been handled, the data has not been scaled, and nothing has been used to deal with outliers. There are reasons for all of these.
Firstly, missing values will need to be handled directly when creating and training the machine learning models. This is due to data leakage. The method used in this work to handle missing values is KNN imputation. KNN imputation is a technique used to fill in missing values using the values of the k nearest neighbours to the missing value using a distance metric which is Euclidean by default. The missing value is then imputed using the average of the nearest neighbours values. If this is done before the data is split for training and testing, then the data in the test set will influence the imputation of the data in the training set, which can lead to overfitting.
The same problem is true for data scaling. The method used in this work for data scaling will be the StandardScalar that scales the input features to have zero mean and unit variance. StandardScaler uses the z-score normalization method to standardize the data. As a result, both imputation and scaling will take place within model pipelines to avoid data leakage. This can be seen in the code for those sections, but will not be directly discussed. Where the KNN imputer and StandardScalar can be seen, assume that these steps are taking place.
Finally, whilst outliers have been considered, the decision has been made to leave them in the data. This is because it is impossible to know with the information provided whether they are genuine data points, or errors in measurement, as we do not have information regarding how the data was collected or what each column mean. We also don't have access to any literature or have any understanding what a usual range of values for each feature might be. Removing outliers, or treating them in some other method, may increase model accuracy, however if the values are genuine then it decreases the models representativeness. This could harm the models ability to generalise and handle new data, if the new data will potentially also include outliers. This is something that could be worked on and improved in future work, and a better method may exist.
The result of all of this EDA and preprocessing is a cleaned and transformed dataset. This dataset will now be used in the next two sections: Unsupervised Learning and Supervised Learning
Unsupervised learning is a machine learning method that handles data with no labels or categories. The objective with these methods is usually to find patterns or structures in the data without any knowledge of the outcomes or particular groupings. The algorithms used in unsupervised learning are designed to identify relationships and what potential groups and clusters might be.
The dataset in this task is labelled through the diagnosis column. In many of the methods used in this section, that column will be temporarily removed to see how the unsupervised methods group the data, and the data will then be relabelled with the diagnosis to evaluate how the methods perform.
The methods performed in this section will be k means clustering, hierarchical clustering, tSNE, and PCA.
The first method that will be used is clustering, and in particular k-means clustering. Clustering is a method where data points are grouped based on similarities in their features. This can be useful for identifying patterns and shapes in the data.
In k-means clustering, the number of clusters is set as the value k, and the algorithm groups the data points into those k clusters by minimizing the distance between each data point and the centroid of the cluster it belongs to. The centroid is the center point of each cluster, which is calculated as the average of all the data points in the cluster.
The k-means algorithm starts by randomly selecting k data points as the initial centroids for the clusters. Each data point is then assigned to the nearest centroid, and the centroid is updated by taking the average of all the data points in that cluster. This process is repeated until the centroids no longer change, or until a maximum number of iterations is reached.
The code below runs a k-means algorithm on the transformed dataset. It then provides two values and a graph to assess the performance. The silhouette score is a metric used to evaluate the quality of clustering results. It measures how well each data point fits into its assigned cluster, as well as how well it is separated from other clusters. The silhouette score ranges from -1 to 1, where a score of 1 indicates a perfect clustering, -1 indicates incorrect clustering, and 0 indicates overlapping clusters. In this instance, the silhouette score is 0.12, which indicates a relatively low level of clustering, with very overlapping clusters. It is worth noting that a low silhouette score does not necessarily indicate a mistake, but can provide insights into the shape of the data, and how easily the data is capable of being clustered. A k value of 2 was chosen for this to mirror the 2 groupings for the Diagnosis column, however other k values were tested to see the silhouette score. A k value of 2 had the highest silhouette score regardless.
With that in mind, the next score used is the adjusted random index. This measures the similarity between two sets of clustering results. In this case, the results of the k-means clustering is compared directly to the actual groupings of the Diagnosis column. This score ranges from -1 to 1, with a 1 being a perfect similarity. The k-means clustering had a adjusted random index of 0.86 with the diagnosis label, which indicates that the clustering performance was very similar to the Diagnosis grouping. This indicates that the Diagnosis label is a very natural way of grouping the individuals within the dataset.
This can be visualised in the graph, where Rumination and Tension were chosen as two random features. The colour represents the k-means clusters, and the shape represents the Diagnosis label. You can clearly see how there is strong overlap between the two clusters, but still two distinct shapes. You can also see how the majority of each colour is only one shape, representing the high similarity between the clusters and the Diagnosis label.
# impute missing value
imputer = KNNImputer(n_neighbors=5)
df = pd.DataFrame(imputer.fit_transform(data_trans), columns=data_trans.columns)
# scale the data using StandardScaler
scaler = StandardScaler()
df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
target = data_trans["Diagnosis"] # known target variable
# perform k-means clustering
kmeans = KMeans(n_clusters=2).fit(df)
# Get the predicted cluster labels
y_pred = kmeans.predict(df)
# Calculate the silhouette score
silhouette_avg = silhouette_score(df, y_pred)
# Print the silhouette score
print('Silhouette Score: {:.2f}'.format(silhouette_avg))
# compute the adjusted Rand index
ari = adjusted_rand_score(target, kmeans.labels_)
# add cluster labels and target variable to the dataframe
df['cluster'] = kmeans.labels_
df['target'] = target
# plot the clusters and target variable using plotly
fig = px.scatter(df, x='Rumination', y='Tension', color='cluster', symbol='target', color_continuous_scale="Portland")
fig.show()
print('Adjusted Rand index:', ari)
c:\Users\hagu2\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
Silhouette Score: 0.12
Adjusted Rand index: 0.8604138428358107
The next clustering method to be used was agglomerative clustering, which is a form of hierarchical clustering. This method involves dividing a dataset into nested clusters, which are organized into a tree-like structure known as a dendrogram. Agglomerative clustering starts with each data point as a separate cluster and then iteratively merges the two closest clusters into a larger cluster until all data points belong to a single cluster. This can reveal the structure of the data at multiple levels of granularity, allowing us to explore the data in a more flexible way.
The code below runs agglomerative clustering on the transformed dataset. The adjusted random index is not taken for this method as the number of clusters is not predetermined. Instead of comparing the results to the Diagnosis label, this will be used to see the dataset in a more granular way, to see if it provides any insight. There are many linkage methods for agglomerative clustering, and the code below shows the silhouette score for the average linkage, which was the best performing with a silhouette score of 0.71
# Split the data into feature and target dataframes
X = data_trans.drop('Diagnosis', axis=1)
# Create a SimpleImputer to impute missing values with the mean of the corresponding column
imputer = KNNImputer(n_neighbors=5)
# Impute the missing values in the feature matrix X
X_imputed = imputer.fit_transform(X)
# Create a pipeline with a StandardScaler and KMeans estimator
pipeline = make_pipeline(StandardScaler(), AgglomerativeClustering(linkage="average", metric="euclidean"))
# Fit the pipeline to the data
pipeline.fit(X_imputed)
# Fit the pipeline to the data and get the predicted cluster labels
y_pred = pipeline.fit_predict(X_imputed)
# Calculate the silhouette score
silhouette_avg = silhouette_score(X_imputed, y_pred)
# Print the silhouette score
print('Silhouette Score: {:.2f}'.format(silhouette_avg))
Silhouette Score: 0.71
The code below now plots two dendrograms for different linkage methods of agglomerative clustering. These linkage methods are the ward and average methods. Dendrograms show the hierarchical structure of the clustering results. Each level of the dendrogram represents a different clustering solution, with the top of the dendrogram representing a single cluster that contains all the data points, and the bottom level consisting of many small clusters, each containing only a few data points.
The resulting dendrograms look messy, but potentially provide interesting results. Firstly, the dendrogram on the right, for average linkage, has very high clusters, which explains the high silhouette score as each cluster is clearly quite distinct. The dendrogram on the left for the ward linkage, shows that the optimal number of clusters might be quite high, certainly higher than the 2 clusters used in k-means clustering.
Taking the two different clustering methods into account, it is possible that the dataset is quite difficult to cluster. One reason for this could be leaving in outliers. Outliers in the data can often form their own clusters, which may explain the high number of clusters in the below dendrograms. Even messy graphs can provide insights and understanding into the shape of the data and how the samples are distributed.
# Split the data into feature and target dataframes
X = data_trans.drop('Diagnosis', axis=1)
# Create a SimpleImputer to impute missing values with the mean of the corresponding column
imputer = KNNImputer(n_neighbors=1)
# Impute the missing values in the feature matrix X
X_imputed = imputer.fit_transform(X)
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
# Define the linkage methods to use
linkage_methods = ['ward', 'average']
# Create two subplots, one for each linkage method
fig, axs = plt.subplots(1, 2, figsize=(16, 8))
# Iterate over the linkage methods and plot the corresponding dendrogram in each subplot
for i, method in enumerate(linkage_methods):
Z = linkage(X_scaled, method=method)
axs[i].set_title(f'Hierarchical Clustering Dendrogram ({method} linkage)')
axs[i].set_xlabel('Sample Index')
axs[i].set_ylabel('Distance')
dendrogram(Z, ax=axs[i], leaf_rotation=90., leaf_font_size=8.)
for subplot in axs:
subplot.set_xticklabels([])
The next unsupervised method that I implemented was t-SNE. t-distributed stochastic neighbour embedding, is a nonlinear dimensionality reduction technique that can be used to visualize high-dimensional data in two or three dimensions. It can be useful for visualizing clusters, patterns, and trends, that may not be visible or easy to visualise in the original high-dimensional space.
The t-SNE algorithm calculates the pairwise similarities between the data points in the high-dimensional space using a Gaussian distribution. It then maps these similarities to a lower-dimensional space by minimizing the divergence between the pairwise similarities of the high-dimensional and low-dimensional representations of the data points.
The code block below uses the t-SNE algorithm on the transformed dataset to produce a scatterplot. The scatterplot represents the dataset in a two-dimensional space, where the position on the plot represents its similarity to the other data points, with closer points being more similar. The data was then coloured by the Diagnosis label to see how the overall shape of the dataset can be visualised. Perplexity is a parameter in t-SNE that can greatly impact what these visualisations look like, therefore I have plotted 6 graphs for 6 different values of perplexity to see how this shape transforms.
The subplots provide some very interesting insights. Firstly, for all values of perplexity, clear separation between both Diagnosis labels can be seen, with each side almost existing on different halves of a circle-like shape. Interestingly, at perplexity 50, this shape begins to change, with the two different Diagnosis labels shifting to switch sides, which provides an interesting visual, before at higher perplexity levels, the sides have been fully swapped. In future work it could be interesting to map this in 3D space and animate the scatterplot for ascending values of perplexity to see how the shape warps.
From the t-SNE and the clustering, it seems as though the Diagnosis labels do exist in clearly recognisable clusters, but that these clusters have a large amount of overlap. This could be due to a particular threshold in Diagnosis. People are either diagnosed or not, but in reality people are more likely to exist on a spectrum, with some people only just meeting the criteria for Diagnosis, and some people only just missing the criteria for Diagnosis. This can't be proven with the information provided, but it would provide a logical explanation as to why such overlap exists, and why creating distinct clusters may be difficult.
# Create a SimpleImputer to impute missing values with the mean of the corresponding column
imputer = KNNImputer(n_neighbors=5)
# Impute the missing values in the feature matrix X
data_imputed= imputer.fit_transform(data_trans)
# Separate the labels (if any) from the data
labels = data_imputed[:, 0]
data = data_imputed[:, 1:]
# Define a range of perplexity values to iterate over
perplexities = [10, 20, 30, 40, 50, 100]
# Set up the subplot grid
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(14, 8))
# Iterate over the perplexity values and the subplots
for i, perplexity in enumerate(perplexities):
row = i // 3
col = i % 3
# Initialize a t-SNE model with the current perplexity
tsne = TSNE(perplexity=perplexity, learning_rate=300, random_state=10)
# Fit the model to the data
embedding = tsne.fit_transform(data)
# Create a new NumPy array with the embedded coordinates and labels (if any)
embedded_data = np.column_stack((embedding, labels)) if labels is not None else embedding
# Convert the NumPy array to a Pandas DataFrame (optional)
embedded_df = pd.DataFrame(embedded_data, columns=['x', 'y', 'label'])
# Plot the embedding on the current subplot
sns.scatterplot(x='x', y='y', data=embedded_df, palette='bright', alpha=0.7, hue='label', ax=axes[row,col])
axes[row,col].set_title(f"Perplexity = {perplexity}")
# Show the plot
plt.tight_layout()
plt.show()
The final method of unsupervised learning that will be used is PCA. Principal component analysis, is a technique used for dimensionality reduction by projecting a dataset onto a lower-dimensional space. It identifies the patterns and trends in the data and transforms the data into a new coordinate system that represents these as orthogonal axes called principal components.
The output of PCA is these principal components, which are a set of transformed features that capture the most important patterns in the data. The first principal component represents the direction of maximum variability in the data, while the subsequent components capture progressively smaller amounts of variability.
The code below performs PCA on the transformed dataset and outputs two graphs and a range of values. The first graph is a 3D plot of the first 3 principal components that are created, and then coloured based on the Diagnosis label. As seen in clustering and t-SNE, there are two clear but overlapping regions for each Diagnosis outcome.
The range of values represents the explained variance ratio. This value shows how much of the original datasets variance can be explained by each principal component. This is then plotted cumulatively in the graph below it. This shows that almost all of the principal components are needed to explain the variance in the dataset, with the first 3 components only accounting for around 30% of the variance.
These results in combination suggest that only a small percentage of the variance of the dataset is needed to visualise the distinct shapes of each Diagnosis group. This is interesting information, as the PCA results will later be tested on the best performing supervised learning model. This suggests that even the first 3 principal components may perform reasonably well, despite accounting for such a small percentage of the variance. However, that is yet to be seen.
# Separate the features from the target variable
X = data_trans.drop('Diagnosis', axis=1)
y = data_trans['Diagnosis']
# Impute missing values
imputer = KNNImputer(n_neighbors=5)
X = imputer.fit_transform(X)
# Scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Perform PCA on the data
pca = PCA()
pca.fit(X)
X_pca = pca.transform(X)
# Create a 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# Plot the data points
ax.scatter(X_pca[:,0], X_pca[:,1], X_pca[:,2], c=y)
# Set the labels for the three axes
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
# Show the plot
plt.show()
# Create a new DataFrame with the principal components
df_pca = pd.DataFrame(X_pca)
# Convert the explained variance ratio to percentage
explained_variance_ratio_pct = [round(ratio*100, 2) for ratio in pca.explained_variance_ratio_]
# Print the explained variance ratio as percentage
print(f"Explained variance ratio: {explained_variance_ratio_pct}")
cumulative_sum = [sum(explained_variance_ratio_pct[:i+1]) for i in range(len(explained_variance_ratio_pct))]
# Plot the cumulative sum of the explained variance ratio
plt.plot(range(1, len(cumulative_sum)+1), cumulative_sum, marker='o')
plt.xlabel('Number of components')
plt.ylabel('Cumulative sum of explained variance ratio (%)')
plt.show()
Explained variance ratio: [14.86, 8.41, 6.05, 5.77, 5.34, 4.73, 4.45, 4.35, 3.85, 3.82, 3.77, 3.63, 3.58, 3.45, 3.4, 3.29, 3.25, 3.23, 3.11, 2.96, 1.79, 0.81, 0.69, 0.66, 0.4, 0.36, 0.0]
To quickly test what was discussed in the PCA, the clustering techniques have been revisited using only the first 2 components of the PCA. As can be seen below, for the k-means clustering, using only 2 principal components, actually greatly improves on the silhouette score, but greatly reduces the adjusted random index. This makes sense, as the PCA has helped identify more distinct clusters, but the limited variance included in the components means the direct comparison to the Diagnosis labels performs poorly.
# Split the data into feature and target dataframes
X = df_pca.iloc[:, : 1]
target = data_trans["Diagnosis"] # known target variable
# perform k-means clustering
kmeans = KMeans(n_clusters=2).fit(X)
# Get the predicted cluster labels
y_pred = kmeans.predict(X)
# Calculate the silhouette score
silhouette_avg = silhouette_score(X, y_pred)
# Print the silhouette score
print('Silhouette Score: {:.2f}'.format(silhouette_avg))
# compute the adjusted Rand index
ari = adjusted_rand_score(target, kmeans.labels_)
print('Adjusted Rand index:', ari)
The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Silhouette Score: 0.61 Adjusted Rand index: 0.6303620865465726
Now using agglomerative clustering with only 2 PCA components produces much cleaner looking dendrograms. In this case, the ward linkage suggests an optimal number of 2 clusters, which would be expected with the 2 possible Diagnosis labels. Meanwhile the average linkage suggests either 2 or 3 clusters, perhaps due to the influence of outliers. Overall, there are not many specific conclusions that can be drawn from this. It seems as though PCA can improve upon clustering performance in general, however, due to the high number of components needed to capture a good percentage of the variance the reliability and interpretability of the results may be less. Future work could look at other dimensionality reduction techniques to see if this can be improved upon.
# Split the data into feature and target dataframes
X = df_pca.iloc[:, : 1]
# Create a SimpleImputer to impute missing values with the mean of the corresponding column
imputer = KNNImputer(n_neighbors=1)
# Impute the missing values in the feature matrix X
X_imputed = imputer.fit_transform(X)
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
# Define the linkage methods to use
linkage_methods = ['ward', 'average']
# Create two subplots, one for each linkage method
fig, axs = plt.subplots(1, 2, figsize=(16, 8))
# Iterate over the linkage methods and plot the corresponding dendrogram in each subplot
for i, method in enumerate(linkage_methods):
Z = linkage(X_scaled, method=method)
axs[i].set_title(f'Hierarchical Clustering Dendrogram ({method} linkage)')
axs[i].set_xlabel('Sample Index')
axs[i].set_ylabel('Distance')
dendrogram(Z, ax=axs[i], leaf_rotation=90., leaf_font_size=8.)
for subplot in axs:
subplot.set_xticklabels([])
Overall, the unsupervised learning methods provided many good insights. It seems as though each Diagnosis outcome does take its own distinct place within the data, but that there is a lot of overlap between the two groups. It also shows that the data is highly complex, and many features are needed to fully capture the information within the dataset. More work can definitely be done in this area, to look at different techniques and also to attempt more of these techniques in combination with supervised learning methods to see if it can improve model performance.
The next step is using supervised learning models to predict diagnosis. This is a binary classification problem, where 0 represents a control and 1 represents a disease case, and three different machine learning models have been selected to be trained. These models are logistic regression, random forest, and XGBoost models.
Each model will be trained and assessed on their accuracy, f1 score and AUC-ROC. It is worth noting that each model was tuned according to the AUC-ROC, to get the highest score for that. The optimisations for the AUC-ROC do not necessarily match the optimisations for the other performance metrics. For example, some of the model tunings produced greater accuracy than is in this report, of around 95%, however the AUC-ROC for these models was lower. The AUC-ROC was chosen as it is a robust performance metric for binary classifiers, compared to the accuracy.
As data imputation and data scaling were not handled within the EDA, this will need to be conducted within the pipelines for each machine learning model where appropriate. This is in order to avoid data leakage. However, XGBoost can handle missing values and non-scaled data itself.
Each model will also use cross validation to ensure that the results are reliable.
The performance metrics of each model will be compared to select the best model.
The code below imports all the additional tools needed that are not already imported:
The first classification model to be trained was a logistic regression classifier. This models the relationship between the dependent binary target variable and the independent variables in the dataset by estimating the probability of the dependent variable being either 0 or 1, based on the values of the independent variables.
The classifier works by calculating the weighted sum of the input features, that is then passed through a sigmoid function, which maps the continuous output of the weighted sum to a probability value between 0 and 1.
The output of the logistic regression classifier can be interpreted as the probability that the input belongs to the positive class, and if the probability is above a certain threshold the input is classified as positive, otherwise it is classified as negative. This can be used to make predictions.
The below code creates and trains the logistic regression classifier. This classifier has two parameters that need to be tuned to achieve better performance. C is a regularisation parameter that controls the strength of the regularization applied to the model during training, and penalty is a parameter that specifies the type of regularisation to be used. As there are only two parameters with quite a limited number of possible values, I tuned the model manually by selecting different values and noting the performance. Through this method, a well performing model was created.
The model had an accuracy score of 0.934, an F1 score of 0.93 and an AUC-ROC of 0.982.
# Separate the features (X) and target variable (y) from the dataframe
X = data_trans.drop('Diagnosis', axis=1)
y = data_trans['Diagnosis']
pipeline = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('scaler', StandardScaler()),
('lr', LogisticRegression(solver='liblinear', random_state=42, penalty='l2', C=0.44))])
# Use cross-validation to evaluate the performance of the pipeline
scores = cross_val_score(pipeline, X, y, cv=5)
# Print the accuracy score
print('Accuracy:', np.mean(scores))
# Get predicted probabilities for computing AUC
y_proba = cross_val_predict(pipeline, X, y, cv=5, method='predict_proba')[:, 1]
auc = roc_auc_score(y, y_proba)
print('AUC:', auc)
# Get predicted labels for computing classification metrics
y_pred = cross_val_predict(pipeline, X, y, cv=5)
# Print classification report with precision, recall, F1-score
print('Classification Report:\n', classification_report(y, y_pred))
Accuracy: 0.9343999999999999
AUC: 0.9817799483098446
Classification Report:
precision recall f1-score support
0 0.93 0.93 0.93 2471
1 0.94 0.93 0.94 2529
accuracy 0.93 5000
macro avg 0.93 0.93 0.93 5000
weighted avg 0.93 0.93 0.93 5000
The second machine learning model to be trained is a random forest classifier. A random forest classifier is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. It works by creating a large number of decision trees, each of which makes a prediction based on a subset of the features in the dataset.
The algorithm selects a subset of the training data and a subset of the features for each decision tree. This process, known as bagging, helps to reduce overfitting and improve the generalization performance of the model. Once the decision trees are created, they are combined to make a final prediction. In the case of binary classification, the final prediction is made by a majority vote of the individual decision trees. The class with the most votes is selected as the final output of the model.
The code below creates and trains the random forest classifier. First, the data is imputed using a k-nearest neighbours imputation, and then scaled with the standard scalar. Cross validation is then used to assess model performance. Random forest has a range of parameters that can be tuned to increase model performance. In order to tune the model performance, a grid search is conducted. A grid search runs through a range of possible parameters to determine the best performing parameter combination. This can be computationally expensive, so the next code block that performs the grid search can potentially be skipped. The best performing parameter values have been taken and used to train an optimised random forest model in the code block after, so moving straight to that block is possible.
# Separate the features (X) and target variable (y) from the dataframe
X = data_trans.drop('Diagnosis', axis=1)
y = data_trans['Diagnosis']
# Define the pipeline with steps for scaling, imputation, and the classifier
pipeline = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('scaler', StandardScaler()),
('rf', RandomForestClassifier(random_state=42))
])
# Define the parameter grid to search over
param_grid = {
'rf__n_estimators': [50, 100, 200],
'rf__max_depth': [10, 20, 30, None],
'rf__min_samples_split': [2, 5, 10],
'rf__min_samples_leaf': [1, 2, 4],
'rf__max_features': ['sqrt', 'log2']
}
# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(pipeline, param_grid=param_grid, scoring='roc_auc', cv=5, n_jobs=-1)
grid_search.fit(X, y)
# Print the best combination of parameters and the corresponding accuracy score
print('Best parameters:', grid_search.best_params_)
print('Best score:', grid_search.best_score_)
Best parameters: {'rf__max_depth': 20, 'rf__max_features': 'log2', 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'rf__n_estimators': 200}
Best score: 0.9821119475087802
This code block shows the optimised random forest classifier. The optimal parameter values from the grid search can be seen in the random forest classifier within the pipeline. This model performs reasonably well, with an accuracy of 0.93, an f1 score of 0.93, and an AUC_ROC of 0.98, each to 2 decimal places.
# Separate the features (X) and target variable (y) from the dataframe
X = data_trans.drop('Diagnosis', axis=1)
y = data_trans['Diagnosis']
# Define the pipeline with steps for scaling, imputation, and the classifier
pipeline = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('scaler', StandardScaler()),
('rf', RandomForestClassifier(n_estimators=200, max_depth=20, max_features='log2', min_samples_leaf=2, min_samples_split=5))
])
# Use cross-validation to evaluate the performance of the pipeline
scores = cross_val_score(pipeline, X, y, cv=5)
# Print the accuracy score
print('Accuracy:', np.mean(scores))
# Get predicted probabilities for computing AUC
y_proba = cross_val_predict(pipeline, X, y, cv=5, method='predict_proba')[:, 1]
auc = roc_auc_score(y, y_proba)
print('AUC:', auc)
# Get predicted labels for computing classification metrics
y_pred = cross_val_predict(pipeline, X, y, cv=5)
# Print classification report with precision, recall, F1-score
print('Classification Report:\n', classification_report(y, y_pred))
Accuracy: 0.9311999999999999
AUC: 0.9811978699853854
Classification Report:
precision recall f1-score support
0 0.92 0.94 0.93 2471
1 0.94 0.92 0.93 2529
accuracy 0.93 5000
macro avg 0.93 0.93 0.93 5000
weighted avg 0.93 0.93 0.93 5000
The final model to be trained is an XGBoost model. XGBoost is a gradient boosting method that iteratively adds decision trees to the model, with each new tree correcting the errors of the previous trees. XGBoost shares similarities with random forest classifiers in that they both create an ensemble of decision trees to make a final prediction. However, there are many differences between the two algorithms such as XGBoost using a gradient-based optimization approach instead of bagging.
The XGBoost algorithm works by defining a loss function that measures the difference between the predicted outputs and the true outputs of the training data. The algorithm then iteratively adds decision trees to the model, with each new tree minimizing the loss function by correcting the errors of the previous trees. To prevent overfitting, XGBoost uses a technique called regularization, and XGBoost is also capable of handling missing values and non-scaled data.
It is often one of the most powerful and best performing models, however, to achieve top performance there are many parameters that should be tuned. In order to tune the hyperparameters, I used a Python package called Optuna, as gridsearch was too computationally expensive for the large range of parameters I wanted to tune. Optuna is a useful optimisation tool, however it still takes a long time. As a result, I ran the optimisation seperately to this work book, and have just extracted the parameter values that performed the best. The full range of values that were tested can be seen in the code block below, however I have removed the Optuna code surrounding it, for simplicity. One main difference between Optuna and gridsearch is that whilst gridsearch runs through every possible combination, Optuna uses a Bayesian optimization strategy that builds a probabilistic model of the objective function based on past evaluations and uses it to suggest new sets of hyperparameters to evaluate. This allows Optuna to search the hyperparameter space more efficiently and effectively than GridSearch.
""" param = {
'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
'max_depth': trial.suggest_int('max_depth', 3, 15),
'learning_rate': trial.suggest_loguniform('learning_rate', 0.001, 0.5),
'subsample': trial.suggest_discrete_uniform('subsample', 0.5, 1.0, 0.05),
'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.5, 1.0, 0.05),
'gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 100),
'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 100),
'scale_pos_weight': trial.suggest_uniform('scale_pos_weight', 0, 10),
}
"""
" param = {\n 'n_estimators': trial.suggest_int('n_estimators', 50, 1000),\n 'max_depth': trial.suggest_int('max_depth', 3, 15),\n 'learning_rate': trial.suggest_loguniform('learning_rate', 0.001, 0.5),\n 'subsample': trial.suggest_discrete_uniform('subsample', 0.5, 1.0, 0.05),\n 'colsample_bytree': trial.suggest_discrete_uniform('colsample_bytree', 0.5, 1.0, 0.05),\n 'gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),\n 'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),\n 'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-8, 100),\n 'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-8, 100),\n 'scale_pos_weight': trial.suggest_uniform('scale_pos_weight', 0, 10),\n } \n"
The following code block creates and trains the optimised XGBoost classifier model. The model is cross validated with a stratified K fold method, before the average performance metrics were taken. Instead of iterating through the range of performance metrics, I have simply ran the model 3 times for each desired metric, and taken a note of what the result is. This model is the best performing so far with an accuracy of 0.934, an F1 score of 0.937 and an AUC-ROC of 0.987.
# Split the data into features and target
X = data_trans.drop('Diagnosis', axis=1)
y = data_trans['Diagnosis']
# Define the XGBoost model and its hyperparameters
model = xgb.XGBClassifier(
max_depth=4,
learning_rate=0.02979300275648818,
subsample= 1.0,
n_estimators=579,
colsample_bytree= 0.55,
gamma= 0.005917904874804278,
min_child_weight= 10,
reg_alpha= 4.553986093186758e-08,
reg_lambda= 2.635622296775393e-07,
scale_pos_weight= 4.37625191889474,
objective='binary:logistic',
seed=42
)
# Define the cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Evaluate the XGBoost model using cross-validation
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
# Print the cross-validation scores and their mean and standard deviation
print('Cross-validation scores:', scores)
print(f'Mean accuracy: {scores.mean():.3f}')
print(f'Standard deviation: {scores.std():.3f}')
Cross-validation scores: [0.99229489 0.98323759 0.9853899 0.98429774 0.98853085] Mean accuracy: 0.987 Standard deviation: 0.003
Now that the best performing model of the three models has been identified, I wanted to see how using the components created in the previous PCA could improve performance. Based on the explained variance ratio, it seems as if all, or almost all, of the components would need to be used in order to achieve a good performance, however it was still something worthwhile to test. To achieve this, I trained the model on the entirety of the PCA dataset using the same XGBoost model as before. The results showed an improvement across all three performance metrics. The model had an accuracy of 0.939, an F1 score of 0.942 and an AUC-ROC of 0.988. Although the increase in performance is small, it is still a noticeable difference that may mean using PCA before XGBoost could be worthwhile. I also tested the model with only the first 3 components of the pca, and that still performed relatively well with an AUC-ROC of 0.974, confirming my prior suspicions from the pca results that the data can still be reasonably well grouped with only a few components.
This is something that could be tested further in future work, by improving on the PCA method or trying other types of dimensionality reduction to increase performance. The XGBoost model could also be retuned for the PCA dataset to see if the optimal parameters are different for this dataset. For now though, the XGBoost model with the PCA dataset was the best performing model in this workbook.
# Split the data into features and target
X = df_pca
y = data_trans['Diagnosis']
# Define the XGBoost model and its hyperparameters
model = xgb.XGBClassifier(
max_depth=4,
learning_rate=0.02979300275648818,
subsample= 1.0,
n_estimators=579,
colsample_bytree= 0.55,
gamma= 0.005917904874804278,
min_child_weight= 10,
reg_alpha= 4.553986093186758e-08,
reg_lambda= 2.635622296775393e-07,
scale_pos_weight= 4.37625191889474,
objective='binary:logistic',
seed=42
)
# Define the cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Evaluate the XGBoost model using cross-validation
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
# Print the cross-validation scores and their mean and standard deviation
print('Cross-validation scores:', scores)
print(f'Mean accuracy: {scores.mean():.3f}')
print(f'Standard deviation: {scores.std():.3f}')
Cross-validation scores: [0.99229889 0.98618201 0.98485782 0.98466179 0.99043304] Mean accuracy: 0.988 Standard deviation: 0.003
This table brings together all of the performance scores for each model. These results are hard coded into the table.
The table shows that all four models have high accuracy, with values ranging from 0.928 to 0.939, high F1 scores, with values ranging from 0.93 to 0.942, and high AUC-ROC values, ranging from 0.981 to 0.988. All four of the models perform reasonably well across all metrics. Overall, the random forest model was the worst performing model across all three performance metrics, followed by logistic regression, then XGBoost, with XGBoost-PCA being the best performing model for every metric.
#create data
data = [["Logistic Regression",0.934 ,0.93 , 0.982],
["Random Forest", 0.928, 0.93, 0.981],
["XGBoost", 0.934, 0.937, 0.987],
["XGBoost-PCA", 0.939, 0.942, 0.988]]
#define header names
col_names = ["Model", "Accuracy", "F1 Score", "AUC-ROC"]
#display table
print(tabulate(data, headers=col_names, tablefmt="fancy_grid"))
╒═════════════════════╤════════════╤════════════╤═══════════╕ │ Model │ Accuracy │ F1 Score │ AUC-ROC │ ╞═════════════════════╪════════════╪════════════╪═══════════╡ │ Logistic Regression │ 0.934 │ 0.93 │ 0.982 │ ├─────────────────────┼────────────┼────────────┼───────────┤ │ Random Forest │ 0.928 │ 0.93 │ 0.981 │ ├─────────────────────┼────────────┼────────────┼───────────┤ │ XGBoost │ 0.934 │ 0.937 │ 0.987 │ ├─────────────────────┼────────────┼────────────┼───────────┤ │ XGBoost-PCA │ 0.939 │ 0.942 │ 0.988 │ ╘═════════════════════╧════════════╧════════════╧═══════════╛
Model explainability is an important aspect of machine learning that involves the ability to interpret and understand how a model makes its decisions. This provides transparency and accountability, and can help build trust in the model's output. Also, explainability can help identify and mitigate bias in the model, which is crucial for ensuring fairness and preventing discrimination. Explainability can also help us gain insights into the underlying patterns and relationships in the data, so we can improve our understanding of the dataset and the feaures within it.
XGBoost models, whilst powerful, can often be difficult to interpret, and to some extrent can be considered a "black box" model. However, there are tools available to better understand and interpet a models results. This section will first take a closer look at the performance of the XGBoost model, and will then use SHAP values to help explain the outcome.
Although the XGBoost-PCA model performed the best, the XGBoost with the original dataset will be used to understand how the features impact the model decision and interact with one another. It is simpler and more useful to analyse the original features, than the PCA components.
Using that model, the first step is to take a deeper look at the models performance. To do this the ROC curve has been plotted. An ROC curve is a plot that illustrates the performance of a binary classifier by changing the threshold for classification and assessing the tradeoff between the true positive rate (TPR) and the false positive rate (FPR) for all different threshold values.
The diagonal line in the plot shows what the results would be for a purely random classifier, whilst the blue curve represents the ROC curve for the XGBoost model. The curve shows a very strong model performance, with only a small dip at the start before achieving a near 1.0 true positive rate. This means that it is possible for only a low number of false positives to be recorded to achieve a very high level of true positives. an AUC of 0.99 to 2 decimal places shows that the model is a near perfect classifier for this performance metric.
# Split the data into features and target
X = data_trans.drop('Diagnosis', axis=1)
y = data_trans['Diagnosis']
# Define the XGBoost model with the best hyperparameters
model = xgb.XGBClassifier(
max_depth=4,
learning_rate=0.02979300275648818,
subsample= 1.0,
n_estimators=579,
colsample_bytree= 0.55,
gamma= 0.005917904874804278,
min_child_weight= 10,
reg_alpha= 4.553986093186758e-08,
reg_lambda= 2.635622296775393e-07,
scale_pos_weight= 4.37625191889474,
objective='binary:logistic',
seed=42
)
# Predict the probabilities of the positive class for each sample using cross-validation
y_prob = cross_val_predict(model, X, y, cv=5, method='predict_proba')[:, 1]
# Compute the false positive rate and true positive rate for different threshold values
fpr, tpr, thresholds = roc_curve(y, y_prob)
# Compute the area under the ROC curve
auc = roc_auc_score(y, y_prob)
# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='b', label=f'ROC curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], color='r', linestyle='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve (Cross-Validation)')
plt.legend(loc='lower right')
plt.show()
Now that the models performance has been asssessed, the next step is to interpret the output, by looking at the importance and interaction between features. To do this, SHAP values will be used. SHapley Additive exPlanations values are a tool that explain the output of a machine learning model by measuring the importance of its features.The SHAP values are computed by analyzing the predictions of the model computing the contribution of each feature to the difference between the predicted value and the average predicted value for all instances. The SHAP values can then help with model interpretation and feature selection.
There are many different SHAP plots that can be used, and some of them will be explored below. The first block of code reruns the optimised XGBoost model and uses it to create the SHAP values.
# Split the dataset into features and target variable
X = data_trans.drop('Diagnosis', axis=1)
y = data_trans['Diagnosis']
# Define the XGBoost model and its hyperparameters
model = xgb.XGBClassifier(
max_depth=4,
learning_rate=0.02979300275648818,
subsample= 1.0,
n_estimators=579,
colsample_bytree= 0.55,
gamma= 0.005917904874804278,
min_child_weight= 10,
reg_alpha= 4.553986093186758e-08,
reg_lambda= 2.635622296775393e-07,
scale_pos_weight= 4.37625191889474,
objective='binary:logistic',
seed=42
)
# Train the XGBoost model on the entire dataset
model.fit(X, y)
# Compute the SHAP values for each feature using the trained model
explainer = shap.Explainer(model)
shap_values = explainer(X)
c:\Users\hagu2\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html ntree_limit is deprecated, use `iteration_range` or model slicing instead.
The first SHAP plot that will be used is a SHAP summary plot. This shows the impact of each feature on model prediction. A positive SHAP value means the feature influenced the model towards the positive class of diagnosis (a value of 1), whilst a negative SHAP value means the feature influenced the model towards the negative class of diagnosis (a value of 0). Each row is a feature, and the colour of the points represents the value of the feature from high to low.
For example, it is clear from the graph that as rumination increases in value, the SHAP value moves in a negative direction. This means that people with high rumination on average were more likely to be predicted as a 0 for the diagnosis. The length of the bar indicates the features overall importance in model prediction, showing that Rumination and Tension were particulary important features, that both operated in opposite directions. This graph provides many useful insights into how each features impacts model prediction.
# Plot the summary plot of SHAP values
shap.summary_plot(shap_values, X)
No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored
The next graph is a SHAP bar plot that measures the absolute impact of each feature to measure overall feature importance. From this graph, Rumination and Tension combined were more important for feature prediction than 18 other features in the dataset. This can be incredibly useful information for feature selection, helping to select the most useful features for future models. It can also provide interesting insights into understanding the disease. As this is a fictional dataset and disease, no real insights can be drawn within this work, but the methodology has huge potential to influence future research, and potentially uncover previously unknown important features, and feature interactions.
For example, if Rumination was not currently considered important in the disease literature, then results like this could encourage further research in that direction. Understanding the most important features is not just key to improving model performances, but to gaining a greater insight into the domain as a whole.
shap.plots.bar(shap_values)
The next graph shows SHAP interaction values, which explain the contribution of a pair of features to a model's output. Interaction values measure the change in the SHAP value of one feature when the value of another feature changes. This can be used to identify pairs of features that interact with each other and affect the model's output. For example, if the interaction value between feature A and feature B is positive, it means that increasing the value of feature A and feature B together will have a larger effect on the model's output than increasing the value of either feature alone. Conversely, if the interaction value is negative, it means that increasing the value of feature A and feature B together will have a smaller effect than increasing the value of either feature alone.
The SHAP interaction values can be visualized using a SHAP interaction plot, which shows the interaction effect between pairs of features. The below graph provides an overview of a range of different SHAP interaction plots. Specific details are difficult to extract from a plot like this, but interesting looking interactions can be identified to look at in more detail.
shap_interaction_values = shap.TreeExplainer(model).shap_interaction_values(X)
shap.summary_plot(shap_interaction_values, X)
In the graph above, an easy to identify abnormality is in the combination of Rumination and Tension. The graph below looks at this in more detail. From the graph there are two clearly different trends for the red curve and the blue curve, representation high and low levels of Tension respectively. For high levels of Tension, as Rumination increases, the SHAP interaction value decreases, which means that the interaction effect between the two features is less. On the other hand, for low levels of Tension, as Rumination increases, the SHAP interaction value increases, meaning the interaction effect between the two features would be greater. This provides an interesting insight. Rumination and Tension were the two most important features in the dataset according to the previous SHAP graphs, however the impact of these features are linked to one another.
This can be very interesting information when studying the disease in the future. Prior to this information, it would have been easy to conclude that people with high Rumination are less likely to be diagnoses. However, this result shows that the value of Tension could have a big impact on how the Rumination value manifests itself. This shows how this methodology could be really useful when looking at real datasets and real diseases, to draw potentially important insights.
shap.dependence_plot(
("Rumination", "Tension"),
shap_interaction_values, X,
display_features=X)
The following code plots two additional graphs to show how SHAP interaction values can be looked at for different feature combinations.
shap.dependence_plot(
("Delusion", "Anhedonia"),
shap_interaction_values, X,
display_features=X)
shap.dependence_plot(
("Delusion", "Dep_Mood"),
shap_interaction_values, X,
display_features=X)
The final plot is a SHAP waterfall plot. This shows the decision making process for a single index in the dataset. This is an incredibly useful tool as it can provide a person with specific explanations. For example, for this specific sample, Tension was the most important feature, rather than Rumination. This means that if the model predicts someone is diagnosed, then it can also provide a specific explanation as to why. This is really useful in communicating with individuals about the decisions made, as well as informing clinicians and researchers.
Particularly when studying disease, the ability to provide feedback on a patient level basis could be very useful in understanding the unique makeup of that individuals disease markers. This also has the potential to inform what additional diagnostic tools should be considered, and potentially inform treatment options, depending on the specific disease and the markers measured.
shap.plots.waterfall(shap_values[0])
Overall, it is clear to see how SHAP values can be incredibly useful in providing model explainability. They can help understand which features are the most important, how they impact model decision making, as well as how they interact with one another. This provides a complex level of information regarding the systems hidden within the data, and the ability to extract that on a sample level basis provides potentially great real world utility.
In conclusion, the unsupervised and supervised learning methods both provided great insights into the data, and how predictive models can be built. The unsupervised methods showed how both Diagnosis outcomes form naturally occuring groupings with large overlaps. Then the supervised learning methods showed how the Diagnosis could be predicted using powerful machine learning models. The best performing model was the XGBoost model with PCA, which highlights how the combination of supervised and unsupervised learning can be used to produce the best models. SHAP values have also been shown to be a very useful tool in explaining the predictive output of a model.
One limitation of this work was in feature selection. Early on, many features were recognised as being potential features to exclude, due to correlation or other factors. This is to avoid the curse of dimensionality. I tested the supervised models on a range of different feature groupings, and in this instance, the best performing model was the model trained on all of the cleaned and transformed features. However in real data, often reducing the number of features is not just about model performance, but also about simplifying model interpretation, and making it easier to gather future data. This is particularly relevant for medical data where gathering a large number of features could take additional time, and potentially cause the patient additional levels of stress and inconvenience. With that in mind, SHAP values could be used in future work to help with feature selection, and narrow down the number of features used within the supervised learning models.
With more time, I would have liked to conduct analysis into the impact of gender, race, and socio-economic factors within the dataset and how they may bias the results. I would also perform a greater range of unsupervised and supervised learning methods, such as potentially deep learning methods.
Overall, this analysis provides some strong conclusions, with a very well performing XGBoost model as the result, and the power of many machine learning techniques has been demonstrated.